Introduction

This tutorial will focus on analysing updated data of the worldwide Novel Corona virus (Covid-19) pandemic.
There are several data sources available online. We will use the data collected from a range of sources by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) and hosted on their GitHub repository. Note that this data is lagging 1 day behind!!

Specific instructions on using R, RStudio and the EcoCloud are available in Workshop1 Tutorial on BB.

Analyse Data in R

Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on EcoCloud) to try and avoid having to download the entire tidyverse. Please install the packages if you’re getting error when running the library() command. In addition we will use the plotly package for interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file and highcharter and geojsonio packages to plot the data on a world map.
To install these packages, we use the install.packages('package') command, please note that the package name need to be quoted and that we only need to perform it once (if we followed step 3b above, or every time we restart the server if we didn’t), or when we want or need to update the package. Once the package was installed, we can load its functions using the library(package) command. Note that in this case we use the package name without quotes!.

# install required packages - needed only once! (comment with a # after first use)
# install.packages(c("dplyr", "ggplot2", "paletteer", "stringr", "readr","readxl", "highcharter", "countrycode", "plotly"))
# load required packages
library(dplyr)
library(readr)
library(stringr)
library(forcats)
library(readxl)
library(ggplot2)
library(paletteer)
library(plotly)
library(highcharter)
library(countrycode)

More information on R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.). Please download the data files from Blackboard and put them in the data folder.

We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file on our computer into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R (which will slightly change the structure of the resulting data frame).
> Note that the path to the files can be either relative or absolute to our current working directory and that we use / and not \, to make life easier, make use of RStudio’s autocompletion feature._

# read data from the 'data' folder
covid_data <- read_csv("data/csse_country_covid_data_21_03_2020.csv")
covid_daily_data <- read_excel("data/COVID-19-geographic-disbtribution-worldwide-2020-03-22.xlsx") %>% 
  rename(Country=`Countries and territories`) %>%
  mutate(Country=str_to_title(Country)) %>%
  group_by(Country) %>% arrange(Country, DateRep) %>%
  mutate(Total_cases=cumsum(Cases), Total_Deaths=cumsum(Deaths),
         iso3c=countrycode(GeoId, origin = 'iso2c', destination = 'iso3c')) %>% 
  ungroup()

# 
#          GeoId = case_when(GeoId=="UK"~"GB",
#                            GeoId=="PS"~"GZ",
#                            GeoId=="EL"~"GR",
#                            GeoId=="XK"~"KV",
#                            GeoId=="JPG11668"~"UM",
#                            TRUE~GeoId)) %>% ungroup()

Data Exploration

Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column:

#explore the data frame
head(covid_data) # show first 10 rows of the data and typr of variables
## # A tibble: 6 x 5
##   `Country/Region` Date       Confirmed Deaths Recovered
##   <chr>            <date>         <dbl>  <dbl>     <dbl>
## 1 Afghanistan      2020-01-22         0      0         0
## 2 Afghanistan      2020-01-23         0      0         0
## 3 Afghanistan      2020-01-24         0      0         0
## 4 Afghanistan      2020-01-25         0      0         0
## 5 Afghanistan      2020-01-26         0      0         0
## 6 Afghanistan      2020-01-27         0      0         0
str(covid_data) # show data structure
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9960 obs. of  5 variables:
##  $ Country/Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Date          : Date, format: "2020-01-22" "2020-01-23" ...
##  $ Confirmed     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country/Region` = col_character(),
##   ..   Date = col_date(format = ""),
##   ..   Confirmed = col_double(),
##   ..   Deaths = col_double(),
##   ..   Recovered = col_double()
##   .. )

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continous numerical values.

# summary of variables in my data
summary(covid_data)
##  Country/Region          Date              Confirmed         Deaths       
##  Length:9960        Min.   :2020-01-22   Min.   :    0   Min.   :   0.00  
##  Class :character   1st Qu.:2020-02-05   1st Qu.:    0   1st Qu.:   0.00  
##  Mode  :character   Median :2020-02-20   Median :    0   Median :   0.00  
##                     Mean   :2020-02-20   Mean   :  488   Mean   :  16.83  
##                     3rd Qu.:2020-03-06   3rd Qu.:    3   3rd Qu.:   0.00  
##                     Max.   :2020-03-21   Max.   :81305   Max.   :4825.00  
##    Recovered      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :    0.0  
##  Mean   :  179.2  
##  3rd Qu.:    0.0  
##  Max.   :71857.0

We can see that most of our data contains ‘0’ (check the median of Confirmed column). Just to confirm that, let;s plot a histogram of all the confirmed cases

ggplot(covid_data, aes(x=Confirmed)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(16)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What are the metadata columns that describe our observations?

Country 
Date

The data is evolving over a time-series, to there’s no point treating it as a random sample.

Time-series plot

Let’s look at confirmed cases data for the 15 most affected countries.

latest_data <- covid_data %>% group_by(`Country/Region`) %>% arrange(desc(Date)) %>% slice(1) %>% ungroup() 
most_affected_countries <- latest_data %>% arrange(desc(Confirmed)) %>% slice(1:10) %>% .$`Country/Region` %>% 
  c(., "Australia")

ggplot(covid_data %>% filter(`Country/Region` %in% most_affected_countries), 
       aes(x=Date, y=Confirmed, colour=fct_reorder(factor(`Country/Region`), Confirmed, .desc = TRUE))) +
  geom_line(size=1) + 
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)

It’s a bit hard to figure out how the pandemic evloves currently because the large numbers in China mask the rest. How can we make it more visible?

# subset just the data from the 10 most affected countries and order them from the most affected to the least one
plot_data <- covid_data %>% filter(`Country/Region` %in% most_affected_countries) %>% 
  mutate(`Country/Region`=fct_reorder(factor(`Country/Region`), Confirmed, .desc = TRUE))
# create the plot
plot <- ggplot(plot_data, aes(x=Date, y=Confirmed, colour=`Country/Region`)) +
  geom_line(size=0.6) + scale_y_log10(labels=scales::comma) +
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)
## Warning: Transformation introduced infinite values in continuous y-axis

Why did we get a warning message? What can we infer from the graph (exponential increase)?

What happens when we take the log of 0??
We can see a very similar trend for most countries and while the curve is flattening in China, South Korea and slightly in Iran, it is still alarmingly steep in Italy and most European countries and the US.

We can also look at the number of daily new cases as a measure to how rapidly the virus spreads in the population.

# find the 10 most affected countries and order them from the most affected to the least one
most_affected_eu_data <- covid_daily_data %>% group_by(Country) %>% arrange(desc(DateRep)) %>% slice(1) %>% ungroup()  %>% arrange(desc(Total_cases)) %>% slice(1:10) %>% .$Country %>% c(., "Australia")

# subset the daily data and add Australia as well
plot_data <- covid_daily_data %>% filter(Country %in% most_affected_eu_data) %>% 
  mutate(Country=fct_reorder(factor(Country), Total_cases, .desc = TRUE)) %>% 
  filter(DateRep>as.Date("2020-02-01"))
# create the plot
plot <- ggplot(plot_data, aes(x=DateRep, y=Cases, colour=Country)) +
  geom_line(size=0.6) + #scale_y_log10(labels=scales::comma) +
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(x="Date", y="Confirmed Cases") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)

What do you suggest influences the numbers?

Choroplet Map

Let’s visualise how what’s the current status around the globe:

# download map data
world_map <- download_map_data("custom/world-palestine-highres")
mapdata <- get_data_from_map(world_map)

# select only most current data
choroplet_data <- covid_daily_data %>%  group_by(Country) %>% arrange(desc(DateRep)) %>%
  slice(1)

# # define colors
bins <- c(0, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, Inf)
cols <- as.character(paletteer_c("viridis::inferno", length(bins)+1, direction = -1))
stops <- data.frame(q=0:length(bins)/length(bins), c=cols) %>% list_parse2(.)

# plot map
  highchart(type = "map") %>%
    hc_add_series_map(map = world_map, df = choroplet_data,
                      value = "Total_cases", joinBy = c("iso-a3", "iso3c")) %>%
    hc_colorAxis(stops = stops) %>%
    hc_tooltip(useHTML=TRUE,headerFormat='',
               pointFormat = paste0('{point.Country} confirmed cases : {point.Total_cases}<br/>Deaths: {point.Total_Deaths}')) %>%
    hc_mapNavigation(enabled = TRUE) %>%
  hc_title(text = "Number of confirmed Covid 19 cases by Country",
           margin = 40, align = "left",
           style = list(color = "#2b908f", useHTML = TRUE)) %>%
  hc_subtitle(text = glue::glue("Data last updated on {format(max(latest_data$Date), '%d/%m/%Y')}"),
              align = "left",
              style = list(color = "#2b908f", fontWeight = "bold")) %>%
  hc_exporting(enabled = TRUE)

What else can we plot or investigate?

Additional Resources

Contact

Please contact me at i.bar@griffith.edu.au for any questions or comments.